Segmentation of Chinese Discourse in Content-Based Information Retrieval

نویسندگان

  • Samuel W. K. Chan
  • Benjamin Ka-Yin T'sou
چکیده

In this paper, we present a novel approach in automatic discourse segmentation without a full semantic understanding. In order to analyse the textual bonds and determine the degree of coherence that a discourse may exhibit, we first represent the tremendous diversity of textual relations into a discourse network. A set of mutual linguistic constraints that largely determines the similarity of meaning among lexical items is encoded. Topic boundaries in a discourse are identified through a computational method which identifies the segment cluster from a higher order structure in the discourse network. Our segmentation is regarded as a process of identifying the shifts from one segment cluster to another. Experimental results show that our formulation is capable to address the topic shifts of texts. Comparison with a related method demonstrates that the combination of constraints is closely related to the topic boundaries among textual segments. Evaluation using recall and precision shows the effectiveness of our approach in a collection of Chinese newswire articles.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Identification in Chinese Discourse Based on Centering Model

In this article we are concerned with identifying topics of utterances in texts, which are discourse elements reflecting the links between a sentence and its context. The information carried by the topics can be used to contribute to a number of natural language processing applications, such as information retrieval, text categorization and discourse segmentation etc. However, the phenomenon of...

متن کامل

Chinese Spam Filtering Based On Back-Propagation Neural Networks

As the email service is becoming an important communication way on the Network, the spam is increasing every day. This paper describes a new filtering model based on email content by using Back-Propagation Neural Networks (BPNN). And for the Chinese email, it uses Natural Language Processing & Information Retrieval Sharing Platform (NLPIR) system to perform Chinese word segmentation. The simula...

متن کامل

Prosody-based Topic Segmentation for Mandarin Broadcast News

Automatic topic segmentation, separation of a discourse stream into its constituent stories or topics, is a necessary preprocessing step for applications such as information retrieval, anaphora resolution, and summarization. While significant progress has been made in this area for text sources and for English audio sources, little work has been done in automatic, acoustic feature-based segment...

متن کامل

Assessing Prosodic And Text Features For Segmentation Of Mandarin Broadcast News

Automatic topic segmentation, separation of a discourse stream into its constituent stories or topics, is a necessary preprocessing step for applications such as information retrieval, anaphora resolution, and summarization. While significant progress has been made in this area for text sources and for English audio sources, little work has been done in automatic segmentation of other languages...

متن کامل

Chinese Discourse Segmentation Based on Punctuation Marks

This paper addresses Chinese discourse segmentation based on punctuation mark. Particularly, we propose various kinds of lexical, syntactic, position and punctuation features to train classifiers for Chinese discourse segmentation. Experimental results on CDTB (Chinese Discourse Treebank) show that our method based on punctuation mark is appropriate for Chinese discourse segmentation with 89.2%...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000